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Summary. Measuring conditional dependence is an important topic in statistics with broad applications 
including graphical models. Under a factor model setting, a new conditional dependence measure based on 
projection is proposed. The corresponding conditional independence test is developed with the asymptotic 
null distribution unveiled where the number of factors could be high-dimensional. It is also shown that the new 
test has control over the asymptotic significance level and can be calculated efficiently. A generic method 
for building dependency graphs without Gaussian assumption using the new test is elaborated. Numerical 
results and real data analysis show the superiority of the new method. 
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1. Introduction 


Undirected graphical model is an important tool to capture dependence among random variables and 
has drawn tremendous attention in various fields including signal processing, bioinformatics and network 


modeling (Wainwright and Jordan, 2008). Let z = ( z ( - 1 ), ..., z^) be a d-dimensional random vector. We 


denote the undirected graph corresponding to z by (V,E), where vertices V correspond to components 
of z and edges E = {e^-, 1 < i,j < d} indicate whether node and are conditionally independent 
given the remaining nodes. In particular, the edge is absent if and only if z^ i z^ |z \ 

When z follows multivariate Gaussian distribution with mean p, and covariance matrix S, the precision 
matrix fi = (wij)dxd = S _1 captures exactly this relationship; that is, Wij = 0 if and only if eij is absent 


(Lauritzen, 1996 Edwards, 2000). Therefore, under the Gaussian assumption, this problem reduces to 


the estimation of precision matrix, where a rich literature on model selection and parameter estimation 


can be found in both low-dimensional and high-dimensional settings, including Dempster (1972), Drton 


and Perlman (2004), Meinshausen and Biihlmann (2006), Friedman et al. (2008), Fan et al. (2009), and 


Cai et al. (2011). While Gaussian graphical model (GGM) can be useful, the stringent requirement on 


normality is not always satisfied in real application where the observed data usually have fat tails or are 


skewed (Xue and Zou, 2012). 


To relax the Gaussian assumption, Liu et al. (2009) proposed the nonparanormal model, where they 
find transformations that marginally gaussianize the data and then work under the Gaussian graphical 
model framework to estimate the network structure. Under the nonparanormal model, |Xue and Zou| 
[(2bl2j ) proposed rank-based estimators to approximate the precision matrix. The nonparanornral model, 
although flexible, still assume the transformed data follows a multivariate Gaussian distribution, which 
can also be restrictive at times. Instead of using these nonparametric methods to find transformations and 
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work under GGM, we would like to propose a more natural way of constructing graphs. That is, we work 
directly on the conditional dependence structure by introducing a measure for conditional dependence 
between node i and j given the remaining nodes. Then, we can introduce a hypothesis testing procedure 
to decide whether the edge e^- is present or not. 

It is worth noting that, based on hypothesis testing, we could indeed build a general conditional 
dependency graph, where presence of an edge e,; 7 between nodes z^\ z^> represents that the two nodes 
are dependent conditional on some factors f. Graphical model is one type of such graphs where f is 
chosen to be z \ {z^ l \ z^}', in such cases, we call f internal factors. More generally, f could contain 
covariates we observe or latent factors that we do not observe, which are not necessarily part of z, and 
we call them external factors. As an example, in Fama-French three-factor model, the return of each 
stock can be considered as one node and f are the chosen three-factors. This example will be further 
elaborated in Section [5j Another interesting application is discussed in Stock and Watson (20021, where 
external factors are aggregated macroeconomic variables, and the nodes are disaggregated macroeconomic 
variables. 

Back to hypothesis testing, in economics, there has been abundant literature on different conditional 


independence tests. Linton and Gozalo (1997) proposed two nonparametric tests of conditional indepen¬ 
dence based on a generalization of the empirical distribution function; however, a complicated bootstrap 
procedure is needed to calculate critical values of the test, which leads to limited practical value. |Su and| 


White (2007. 2008 20141 proposed conditional independence tests based on Hellinger distance, condi¬ 


tional characteristic function and empirical likelihood, respectively. However, all those tests either have 
tuning parameters or are computationally expensive. 

As an attempt to propose a remedy for the above issues, we consider the following setup. In particular, 
suppose {(xj, y,;, f,, ei tX , £i, y ), i = 1 ,...,n} are i.i.d. realizations of (x, y, f, e x , e y ), which are generated 
from the following model: 


X - Gj;(f) -F Ca-, y — G y (f) + e y , (1) 

where f is the A'-dimensional common factors, G x and G y are general mappings from R A to and 
R 9 , respectively. Note that we only observe {(x^ y i} £,), i = 1Here, we assume independence 
between (e x , e y ) and f. In addition, the dimensions p and q are assumed to be fixed while the number of 
factors K could diverge to infinity. 

Our goal is to test whether x and y are independent given f, i.e., 


H 0 : x X y|f• (2) 

Under model 0, the testing problem is equivalent to test whether e x and e y are independent, i.e., 

H 0 : e x X e y . (3) 

Note that in the case of building graphical models without Gaussian assumption, we could use 0 by 
setting x and y as the random vectors associated with a pair of nodes in the graph and f represents 
the rest of the nodes. Then, the method to be developed could be used to construct a high-dimensional 
undirected graphs by conducting the test in 0 for each pair of the nodes. This gives a graphical summary 
of the conditional dependence structure. 

Since {(ej iX , e i,y) , * = 1,..., n} are hidden, a natural idea is to estimate them by the residuals after a 
projection of x and y onto f. Asymptotically, a fully nonparametric projection on f (e.g., local polynomial 











A Projection Based Conditional Dependence Measure 3 

regression) would consistently recover the random errors when K is fixed along with certain smoothness 
assumptions on G x and G y . However, it becomes challenging when K diverges due to the curse of 
dimensionality if no structural assumptions are made on G x and G y . As a result, we will study the case 
where G x and G y are linear functions (factor models) in Section 


2.2 


and the case where G x and G y are 


additive functions in Section 2.5 when K diverges. 

To complete our proposal, we need to find a suitable measure of dependence between random vari¬ 
ables/vectors. In this regards, many different measures of dependence have been proposed. Some of them 
rely heavily on Gaussian assumptions, such as Pearson correlation, which measures linear dependence 
and the uncorrelatedness is equivalent to independence only when the joint distribution is Gaussian; or 


Wilks Lambda (Wilks, 1935), where normality is adopted to calculate the likelihood ratio. To deal with 
non-linear dependence and non-Gaussian distribution, statisticians have proposed rank-based correla¬ 
tion measures, including Spearman’s p and Kendall’s r, which are more robust than Pearson correlation 
against deviations from normality. However, these correlation measures are usually only effective for 
monotone types of dependence. In addition, under the null hypothesis that two variables are inde¬ 
pendent, no general statistical distribution of the coefficients associated with these measures has been 


derived. Other related works include Hoeffding (1948), Blomqvist (1950), Blum et al. (1961), and some 
methods described in Hollander et al. (2013) and Anderson (19621. Taking these into consideration, 
distance covariance (Szekely et al., 2007) was introduced to address all these deficiencies. The major 
benefits of distance covariance are: first, zero distance covariance implies independence, and hence it is 
a true dependence measure. Second, distance covariance can measure the dependence between any two 
vectors which potentially are of different dimensions. Due to these advantages, we will focus on distance 
covariance in this paper as our measure of dependence. 


The main contribution of this paper is two-fold. First, under the factor model assumption, we propose 
a computationally efficient conditional independence test. Both the response vectors and the common 
factors can be of different dimensions and the number of the factors could grow to infinity with sample size. 
Second, we apply this test to build conditional dependency graph and covariates-adjusted dependency 
graph, as generalizations of the Gaussian graphical model. 

The rest of this paper is organized as follows. In Section [2j we present our new procedure for testing 
conditional independence via projected distance covariance (P-DCov) and describe how to construct 
conditional dependency graphs based on the proposed test. Sect ion [3] gives theoretical properties including 
the asymptotic distribution of the test statistic under the null hypothesis as well as the type I error 
guarantee. Section [4] contains extensive numerical studies and Section [5] demonstrates the performance 
of P-DCov via two real data sets. We conclude the paper with a short discussion in Section [6] Several 
technical lemmas and all proofs are relegated to the appendix. 


2. Methods 

First, we introduce some notations. For a random vector z, ||z|| and ||z||i represent its Euclidean norm 
and t\ norm, respectively. A collection of n i.i.d. observations of z is denoted as {z 1 ,...,z n }, where 
Zfc = {z ^\..., z < jf > ) T represents the fc-th observation. For any matrix M, ||M||.f, ||M|| and ||M|| max 
denote its Frobenius norm, operator norm and max norm, respectively. ||M|| ai j, is the (a, b) norm defined 
as the th norm of the vector consisting of column-wise £ a norm of M. 













4 J. Fan, Y. Feng , and L. Xia 

2.1. A brief review of distance covariance 

As an important tool, distance covariance is briefly reviewed in this section with further details available 
in Szekely et dl. (2007). We introduce several definitions as follows. 


Definition 1. (w-weighted L 2 norm) Let Cd = Y((d+\)/ 2 ) > / or an 2/ P os fff ve integer d, where T is the 
Gamm.a function. Then for function 7 defined on x M 9 , the w-weighted L 2 norm of 7 is defined by 


= [C v C a \\T\\ 1 +P \\p\\ 1+q ) K 


\\t{ t ,p)\\w= I h ( r , p )\ 2 w ( r 7 p ) dTdp , where w ( r , p) = (c 
Jg.P+1 

Definition 2. (Distance covariance) The distance covariance between random vectors x £ and 


y e 


with finite first moments is the nonnegative number V(x, y) defined by 


V 2 ( x ,y) = \\9x, y (r,p) - g x (r)g y {p)\\ 


2 

W 1 


where g x , g y and g XtV represent the characteristic functions of x, y and the joint characteristic function 
of x and y, respectively. 

Suppose we observe random sample {(x^, y^) : k = 1, ..., n} from the joint distribution of (x, y). We 
denote X = (xi,x 2 ,... ,x n ) and Y = (yi,y 2 , -.. ,y„). 

Definition 3. (Empirical distance covariance) The empirical distance covariance between samples 
X and Y is the nonnegative random variable V„(X, Y) defined by 

V 2 (X, Y) = Si(X, Y) + S 2 (X, Y) - 25 3 (X, Y), 


where 


Si(x,y) = 4 it, ll x fc — x zlll|yfc — y«IU s 2 (x,y) = ||x fc -xj||-^ J2 lly* — y*l 


k,l=l 
n n 


k,l =1 


k.l =1 


5 , 3(x,y) = — ^ ^ Il x /c - x z||||yfc-y m ||- 


k —1 l,m= 1 


With above definitions, Lemma |T] depicts the consistency of V„(X,Y) as an estimator of V(x,y). 
Lemma [ 2 ] shows the asymptotic distribution of V„(X,Y) under the null hypothesis that x and y are 
independent. Corollary [l] reveals properties of the test statistic nV 2 /^ proposed in Szekely et al. (2007). 


Lemma 1. (Theorem 2 in Szekely et al. (2001)) Assume that E(||x|| + ||y||) < 00, then almost surely 


hm V„(X,Y)=V(x,y). 


Lemma 2. (Theorem 5 in Szekely et al. (200ity) Assume that x and y are independent, and E(||x|| + 
||y||) < 00 , then as n —> 00 , 

nV 2 (X,Y)4||C(r,p)|| 2 , 


D 


where —> represents convergence in distribution and £(•,•) denotes a complex-valued centered Gaussian 
random process with covariance function 


R( u,u 0 ) = (g x {r - t 0 ) - g x (T)g x {T 0 ))(g y (p - p 0 ) - g y (p)g y (p Q )), 


in which u = (r, p), Uq = (tq, p 0 ). 
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Corollary 1. (Corollary 2 in Szekely et al. (2001)) Assume that E(||xj| + ||y||) < oo. 


(i) If x and y are independent, then as n —¥ oo, nV^(X, Yj/S^ —> Q with Q = where 

Zj l ~ d 7V(0,1) and {Aj} are non-negative constants depending on the distribution of (x, y); E(Q) = 

1 . 

(ii) If x and y are dependent, then as n —¥ oo, nV^(X, Yj/S^ oo. 


2.2. Conditional independence test via projected distance covariance (P-DCov) 

Here, we consider the case where G a , and G y are linear in (JT|) , which leads to the following factor model 
setup: 

x = B a ,f + e x , y = Byf + e y , (4) 

where B x and B y are factor loading matrices of dimension p x K and q x K respectively, and f is the 
Jv-dimensional vector of common factors. Here, the number of common factors K could grow to infinity 
and the matrices B 7 and B y are assumed to be sparse to reflect that x and y only depend on several 
important factors. As a result, we will impose regularization on the estimation of B 7 and B y . Now, we 
are in the position to propose a test for problem <§• We first provide an estimate for the idiosyncratic 
components e x and e y , and then calculate distance covariance between the estimates. More generally, we 
project x and y onto the space orthogonal to the linear space spanned by f and evaluate the dependency 
between the projected vectors. The conditional independence test is summarized in the following steps. 
Step 1: Estimate factor loading matrices B c and B y by the penalized least square (PLS) estimators B„ 
and B y defined as follows. 


B x = argnhn * ||X - BF||f. + J2p Xl (\B jk \), 

j,k 

(5) 

B y = argnun^llY — BF||| + ^p A2 (|B yfe |), 

(6) 


j,k 


where X = (x 1; x 2 ,... ,x n ), Y = (yi, y 2 , ■ • •, y n ), F = (^^ 2 ,- • ■ ,f n ), Px(-) is the penalty function with 
penalty level A. 

Step 2 : Estimate the error vectors e.i iX and e^y by 

Ci,x — X 7 B x f", — (Bje B x )f 7 T €i x , 

€i,y y* H y fj (By T ^ 1; - ■ • 7 R* 

Step 3: Define the estimated error matrices E x = ..., e„ jX ) and E y = (ei, y ,..., e n , y )- Calculate 

the empirical distance covariance between E x and E y as 

V,2(E„E y ) = Si(E x ,E y ) + S 2 (E x ,Ey) - 25 3 (E„E y ). 

Step 4: Define the P-DCov test statistic as T(x,y,f) = nV^Ea,, E y )/5' 2 (E a; , E y ). 

Step 5: With predetermined significance level a, we reject the null hypothesis when T(x, y, f) > (^(l- 

a/2)) 2 . 

Theoretical properties of the proposed conditional independence test will be studied in Section [3] In 
the above method, we implicitly assume that the number of variables K is large so that the penalized 
least-squares methods are used. When the number of variables K is small, we can take Ai = A 2 = 0 so 
that no penalization is imposed. 
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2.3. Building graphs via conditional independence test 

Now we explore a specific application of our conditional independence test to graphical models. To 
identify the conditional independence relationship in a graphical model, i.e., z^ X z^|z\{. 2 M, we 

assume 


z i l) = Pi,ij f h + 4 ° . 44 = ffc + 44 > ft = 1 , ... ,n, 


(7) 


where f k = represents all coordinates of zother than z4 and z^ \ and /3 l i j and (3 2 ,ij are 

d — 2 dimensional regression coefficients. Under model ([7]), we decide whether edge e^- will be drawn 
through directly testing X z^>\C{ z( -t ' _J ’)), where £(f) is the linear space spanned by f. 

More specifically, for each node pair {( i,j ) : 1 < i < j < d}, we define = T{z^\ z^\ 

using the same steps as in Section [272] as the test for the current null hypothesis: 


Ho tij : e« X e«. 


( 8 ) 


We now summarize the testing results by a graph in which nodes represent variables in z and the 
edge etj between node i and node j is drawn only when is rejected at level a. 

In Q, the factors are created internally via the observations on remaining nodes z \ {z^,z^}. In 
financial applications, it is often desirable to build graphs when conditioning on external factors. In such 
cases, it is straightforward to change the factors in ([7]) to external factors. 

We will demonstrate the two different types of conditional dependency graphs via examples in Sections 
Q] and [U 


2.4. Graph estimation with FDR control 

Through the graph building process described in Section 


2.3 


we can carry out d = d(d— 1)/2 P-DCov tests 


simultaneously and we wish to control the false discovery rate (FDR) at a pre-specified level 0 < a < 1. 
Let Rp and R be the number of falsely rejected hypotheses and the number of total rejections, respectively. 
The false discovery proportion (FDP) is defined as Rpf max{l,i?} and the FDR is the expectation of 
FDP. 

In the literature, various procedures have been proposed for conducting large-scale multiple hypothesis 
testing via FDR control. In this work, we will follow the most commonly used Benjamini and Hochberg 


(BH) procedure developed in the seminal work of Benjamini and Hochberg (1995), where P-values of all 


marginal tests are compared. More specifically, let P(i) < P( 2 ) < • • • < P ^ be the ordered P-values of 


the d hypotheses given in (|8|. Let s = max{0 < i < d : Pp) < ai/d}, and we reject the s hypotheses Po,ij 
with the smallest P-values. We will demonstrate the performance of this strategy via real data examples. 


2.5. Extension to functional projection 

In the P-DCov described in Section |2.2[ we assume the conditional dependency of x and y given fac¬ 
tor f is expressed via a linear form of f. In other words, we are projecting x and y onto the space 
orthogonal to £(f) and evaluate the dependence between the projected vectors. Although this linear- 
projection assumption makes the theoretical development easier and delivers the main idea of this work, 
a natural extension is to consider a nonlinear projection. In particular, we consider the following additive 


generalization (Stone, 1985) of the factor model setup: 

x = $ (/j) + e *’ y = (/#) 

3 = 1 3 =1 




( 9 ) 
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where {gj(-), gjj (•), j = 1,..., K} are unknown vector-valued functions we would like to estimate. In 
we consider the additive space spanned by factor f. By this extension, we could identify more general 
conditional dependency structures between x and y given f. This is a special case of ([!]), but avoids the 
issue of curse of dimensionality. 


In the high-dimensional setup where K is large, we can use a penalized additive model (Ravikumar 


et al ., 2009 Fan et al ., 2011) to estimate the unknown functions. The conditional independence test 


described in Section 2.2 could be modified by replacing the linear regression with the (penalized) addi¬ 
tive model regression. We will investigate the P-DCov method coupled with the sparse additive model 


(Ravikumar et al., 2009) in numerical studies. 


3. Theoretical Results 


In this section, we derive the asymptotic properties of our conditional independence test. First, we 
introduce several assumptions on e x , e y and f. 

Condition 1. Ee x = Ee y = 0, E||e x || < oo, E||e y || < oo, E||e x || (i) 2 < oo, EUeJI 2 < oo. 

Condition 2. We denote h x as the density function of random variable x. Let us assume that the 
densities of || ei !CC — £ 2 ,nil and ||ei iV — £ 2 ,y || are bounded on [0, Co], for some positive constant Cq. In other 
words, there exists a positive constant M, 

,IK, w £ "• - "• 

Conditions^ and [2] impose mild moment and distributional assumptions on random errors e x and e y . To 
better understand when the proposed method works, we give the following high-level assumption, whose 
justifications are noted below. 


Condition 3. There exist constants C\ > 1 and 7 > 0, such that for any C 2 > 1, with probability 
greater than 1 — Cf G2 , we have for any n, 

||(B X — B X )F || 2>00 < C 2 a ra , || (B y — Bj,)F || 2j00 < C 2 a n , 
where the sequence a n = o{(n^ 1+7 - ) logn) -1 / 3 }. 

Condition 4. Let B Xi i denote the l-th row of B x , and similarly we define B Xj ;, B and B y j. We 
assume for any fixed l, 


|B Xj ; B X) i||i — Cp(e n ), ||By^ By^||i Cp(e n ), 


where sequences e n and a n in Condition]^ satisfy a n e n = o( 




' ,/n log K 


)• 


Remark 1 . Conditions [3| and 2] are mild conditions that are imposed to ensure the quality of the pro¬ 
jection and guarantee the theoretical properties regarding our conditional independence test. For example, 


one could directly call the results from penalized least squares for high-dimensional regression (Buhlmann 


and Van De Geer, 2011 ■ Hastie et al., 2015) and robust estimation \Belloni et al., 2011: Wang, 2013[ 


Fan et al., 2016b). We now discuss two special examples as follows. 


(i) (K is fixed) In this fixed dimensional case, it is straightforward to verify that the projection based 

on ordinary least squares satisfies the two conditions. 
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(ii) (Sparse Linear Projection) Let B x = (bf, bj,..., b y ) T and Bj = (bf, h) ,..., bj ) T . Note that 
the graphical model case corresponds to p = 1. We apply the popular L\-regularized least squares 
for each dimension of x regressing on the factor F. Here, we further assume the true regression 
coefficient bj is sparse for each j with Sj = {k : (bj)^ ^ 0}, Sj = {k : (b j)k ^ 0} and |S'j| = Sj. 
From Theorem 11.1, Example 11.1 and Theorem 11.3 in Hastie et al. (201!fy, and since {fJLi are 
i.i.d., we have with high probability, ||bj — bj|| < C yf S 1 ° g ^, Sj = Sj and max, ||(fi)s J . || < Sj log n. 
Then, we have with high probability, for each i = 1,..., n and j = 1,... ,p, 


||(bj - b i ) T f,|| = || (bj - b,)J ; (f,) s . | < ||(bj - b 7 ) s .||||(f i )< ? .|| < Cs max log n^J 5max ^° sK , (10) 

where s max = max., Sj. It is now easy to verify that Condition and]/] are satisfied even under the 
ultra-high-dimensional case where log K = o(n a ), 0 < a < 1/3. We would like to omit the details 
here for brevity about the specification of various constants. 


Theorem 1. Under Conditions^ and[3J 

£y) V 2 (e x , e y ). 

In particular, when e x and e y are independent, V 2 (e x ,e y ) —>■ 0. 

Theorem [l] shows that the sample distance covariance between the estimated residual vectors con¬ 
verges to the distance covariance between the population error vectors. It enables us to use the distance 
covariance of the estimated residual vectors to construct the conditional independence test as described 
in Section [2721 


Theorem 2. Under Conditions [7]J/J and the null hypothesis that e x _IL e y (or equivalently x i y|f ), 

"V*(e*,e„)4||C|| 2 , 

where ( is a zero-mean Gaussian process defined analogously as in Lemma 

Theorem [2] provides the asymptotic distribution of the test statistic T(x, y, f) under the null hypoth¬ 
esis, which is the basis of Theorem [3] 

Corollary 2. Under the same conditions of Theorem 

OO 

nV 2 { e Xl e y )/ S 2 ( e x , e y ) ->• Q, where Q = ^ A jZ 2 , 

1=1 

where Zj A/"(0,1) and {Aj} are non-negative constants depending on the distribution o/(x, y); E(Q) = 

1 . 


Theorem 3. Consider the test that rejects conditional independence when 

nV 2 (e x ,e y ) > ($ _ 1(1 _ a/2))2> (11) 

02 

where $(•) is the cumulative distribution function ofj\f(0 ,1). Let a„(x,y,f) denote its associated type I 
error. Then under Conditions 00 for all 0 < a < 0.215, 


(i) lim„_ >00 a n (x,y,f) < a , 
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(ii) sup^jl^ lim^oo a„(x, y, f) = a . 

Part (i) of Theorem [3] indicates the proposed test with critical region ( [IT] ) has an asymptotic significance 
error at most a. Part (ii) of Theorem [ 3 ] implies that there exists a pair (e x , e y ) such that the pre-specified 
significant level a is achieved asymptotically. In other words, the size of testing Hq : e x JL e y is a. 


Remark 2. When the sample size n is small, the theoretical critical value in could sometimes he 
too conservative in practice Szekely et al., 2007). Therefore, we recommend using random permutation 
to get a reference distribution for the test statistic T(x,y, f) under Hq. Random permutation is used to 
decouple e, iX and e^ v so that the resulting pair (e n u\ !X , £i, y ) follows the null model, where {7r(l), ..., ir (n)} 
are a random permutation of indices n}. Here, we set the number of permutations R(n ) = 


[200 + 5000/nJ as in Szekely et al. (2007). Consequently, we can also estimate the P-value associated 
with the conditional independence test based on the quantiles of the test statistics over R(n ) random 
permutations. 


4. Monte Carlo Experiments 


In this section, we investigate the performance of P-DCov with five simulation examples. In Example |4.1[ 
we consider a factor model and test the conditional independence between two vectors x and y given their 
common factor f, via P-DCov. In Examples |4.2| we investigate the classical Gaussian graphical model. 


In Example 4.3 we consider the case of general graphical model without the Gaussian assumption. In 
Examples |4.4| and |4.5[ we consider the case of factor based dependency graph and a general graphical 
model with external factors, respectively. 


Example 4.1. [High-dimensional factor model] Let p = 5, q = 10 and I\ = 1000. Assume the rows 
of B x and rows of B y are i.i.d. distributed as zk = (zf ,zTf) T , where Zi is a 3-dimensional vector with 
elements i.i.d. from Unif[ 2,3] and Z 2 = Oks- {f }}" =1 are i.i.d. from A/"(0, Ik). We generate n i.i.d. 
copies {r ;}" =1 from log-normal distribution lnA/"(0, S) where S is an equal correlation matrix of size 
(p + q) x (p + q) with Yljj. = p when j 7 ^ k and E jj = 1 . e^ x and e i y are the centered version of the 
first p coordinates and the last q coordinates of r^. Then, {xj }" =1 and {y,:}" =1 are generated according to 
Xi = B a ,f, + e^ x and y i = B. y fj + y correspondingly. 

In Example |4.1[ we consider a high-dimensional factor model with sparsity structure. Note that 
the errors are generated from a heavy tail distribution to demonstrate the proposed test works beyond 
Gaussian errors. We assume each coordinate of x and y only depends on the first three factors. We 
calculate T(x, y,f) in the P-DCov test, and To(x, y, f) in which we replace e t _ x and e,; j2/ by the true 
and £i^ y as an oracle test to compare with. To get reference distributions of T(x, y, f) and To(x, y, f), we 
follow the permutation procedure as described in Section [3| In this example, we set the significance level 
a = 0.1. We vary the sample size from 100 to 200 with increment of 10 and show the empirical power 
based on 1000 repetitions for both T(x,y,f) and T 0 (x,y,f) in Figure [l] for p £ {0.1, 0.2,0.3, 0.4}. In the 
implementation of penalized least squares in Step 1, we use R package glmnet with the default tuning 
parameter selection method (10-fold cross-validation) and perform least square on the selected variables 
to reduce estimation bias of these estimated parameters. 

From Figure [l] it is clear that as the sample size or p increases, the empirical power also increases in 
general. Also, comparing the panels (A) and (B) in Figure [I] we see that when the sample size is small, 
the P-DCov test has smaller power than the oracle test, however, the difference between them becomes 
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Table 1 . Type I error of Example 1 


Test based on e x and e y 


n 100 

110 

120 

130 

140 

150 

160 

170 

180 

190 

200 

0.129 

0.112 

0.117 

0.095 

0.104 

0.108 

0.098 

0.113 

0.102 

0.117 

0.111 





Test based on e x 

and e y 





n 100 

110 

120 

130 

140 

150 

160 

170 

180 

190 

200 

0.089 

0.092 

0.078 

0.112 

0.104 

0.090 

0.099 

0.113 

0.104 

0.106 

0.096 


Fig-1. Power-sample size graph of Example 1 




(A)T(x,y,f) (B)r„(x,y,f) 


negligible as the sample size increases. This is consistent with our theory regarding the asymptotic 
distribution of the test statistics. When p = 0, Table |T] reports the empirical type I error for both P- 
DCov as well as the oracle version. It is clear that the type I error of P-DCov is under good control as 
the sample size increases. 

Example 4.2. [Gaussian graphical model] We consider a standard Gaussian graphical model with 
precision matrix FI = E _1 , where Ft is a tridiagonal matrix of size d x d, and is associated with the 
autoregressive process of order one. We set d = 30 and the (i,j)-element in E to be aij = exp(— |sj — Sj|), 
where si < S 2 < • • • < Sd- In addition, 

Si — Si -1 Uniform (1,3), i = 2, ..., d. 


In this example, we would like to compare the proposed P-DCov with the state-of-the-art approaches 
for recovering Gaussian graphical models. In terms of recovering structure FI, we compare lasso.dcov 
(projection by LASSO followed by distance covariance), sam.dcov (projection by sparse additive model 
followed by distance covariance), lasso.pearson (projection by LASSO followed by Pearson correlation), 
sam.pearson (projection by sparse additive model followed by Pearson correlation) with three popular 
estimators corresponding to the LASSO, adaptive LASSO and SCAD penalized likelihoods (called graph¬ 


ical.lasso, graphical.alasso and graphical.scad on the graph) for the precision matrix (Friedman et ai, 


2008 Fan et al ., 2009). Here, lasso.dcov and sam.dcov are two examples of our P-DCov methods. We 


use R package SAM to fit the sparse additive model. To evaluate the performances, we construct receiver 
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Fig. 2. ROC curves for Gaussian graphical models 




(A) n = 100 


False positive rate 


(B) n = 300 


operating characteristic (ROC) curves for each method with sample sizes n = 100 and n = 300. The 
process of constructing the ROC curves involves conducting the P-DCov test for each pair of nodes and 
record the corresponding P-values. In each of the ROC curve, true positive rates (TPR) are plotted 
against false positive rates (FPR) at various thresholds of those P-values (“TP” means the true entry of 
the precision matrix is nonzero and estimated as nonzero; “FP” means the true entry of the precision 


matrix is zero but estimated as nonzero.) We follow the implementation in Fan et al. (2009) for the 


three penalized likelihood estimators. The average results over 100 replications of different methods are 
reported in Figure [2] The associated AUC (Area Under the Curve) for each method is also displayed in 
the legend of the figure. 

We observe that lasso.pearson and sam.pearson perform similarly with the penalized likelihood meth¬ 
ods when n = 100. On the other hand, lasso.dcov and sam.dcov lead to slightly smaller AUC value due 
to the use of the distance covariance, which is expected for the Gaussian model. This shows that we do 
not pay a big price for using the more complicated distance covariance and sparse additive model. 

Example 4.3. [A general graphical model] We consider a general graphical model with a combination 
of multivariate t distribution and multivariate Gaussian distribution. The dimension of x is d = 30. In 
detail, x = (x^, x^) T where xi follows a 20 dimensional multivariate t distribution with degrees of freedom 
5, location parameter 0 and identity covariance matrix and x 2 follows the same Gaussian graphical model 
as in Example \4-2\ except the dimension is now 10. In addition, Xi and x 2 are independent. 

To generate a multivariate f-distribution, we first generate a random vector w 2 o from the standard 
multivariate Gaussian distribution and an independent random variable r ~ y 2 (5) and then set xi = 
vf/y/r. One important fact about the multivariate t distribution is that the zero element in the precision 


matrix does not imply conditional independence like the case of Gaussian graphical models (Finegold and 


Drton, 2009). Indeed, for xi, we actually have the fact that x^ and x) j; are dependent given x) 


So) 


S-i,-o) 


for any pair 1 < i j < 20. Consequently, the Gaussian likelihood based methods will falsely claim that 
all the components of Xi are independent. 
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Fig. 3. ROC curves for a general graphical model 



False positive rate 



(A) n = 100 


(B) n = 300 


The average ROC curve results are rendered in Figure [3] As expected, by using the new projection 
based distance covariance method for testing conditional independence, lasso.dcov outperforms all the 
other methods in terms of AUC, with a more evident advantage when n = 300. One interesting obser¬ 
vation is that: in the region where FPR is very low, the likelihood based methods actually outperform 
P-DCov methods. One possible reason is that the likelihood based methods are more capable of capturing 
the conditional dependency structure within x 2 as it follows a Gaussian graphical model. 


Example 4.4. [Factor based dependency graph] We consider a dependency graph with the contribution 
of external factors. In particular, we generate u ~ JV(0,17), where 17 is the same tridiagonal matrix used 
in Example f.2 and f ~ iV(0,1), then the observation x = u + Qg(f) where Q is a sparse coefficient 
matrix that dictates how each dimension of x depends on the factor g(f). In particular, the generation 
of Q follows the setting in Cai et al. (2013). For each element Qij, we first generate a Bernoulli 
distribution with success probability 0.2 to determine whether Qij is 0 or not. If Qij is not 0, we then 
generate Qij ~ Uniform (0.5,1). Here we consider two forms of g(-), namely g( f) = f and g(f) = f 2 . 


We report results regarding the average ROC curves for lasso.pearson, lasso.dcov, sam.pearson and 
sam.dcov. The results for both g(f) = f and g{ f) = f 2 are depicted in Figure [4j Note that we are 
not building a conditional dependency graph among x, but a dependency graph of x conditioning on the 
external factor f. There are some insightful observations from the figure. First of all, by looking at the 
first case when g(f) = f, it is clear that lasso.pearson is the best as it takes advantage of the sparse linear 
structure paired with the Gaussian distribution of the residual. By using the distance covariance as a 
dependency measure, or by using the sparse additive model as a projection method, it is reassuring that 
we do not lose much efficiency. Second, for the case when g(f) = f 2 , we can see a substantial advantage 
of the sparse additive model based methods as it can capture this nonlinear contribution of the factors 
to the dependency structure of x. 


Example 4.5. [A general graphical model with external factors] We consider a general conditional 
dependency graph with the contribution of external factors by combining the ingredients of Examples \4-S\ 
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Fig. 4. ROC curves for factor based dependency graph 




(A) n = 100, g(f) = f 


(B) n = 300, g{{) = f 




(C) n = 100, 5 (f) = f 2 


(D) n = 300, 5 (f) = f 2 


and 4-4 In particular, we generate u from Example \4 -3| and f ~ iV(0, J 30 ), then set x = u + Q 5 (f) where 
Q is the same as Example 4-4 We also consider 5 (f) = f and 5 (f) = f 2 . 


In this example, we would like to investigate the performance of a two-step projection method. In 
particular, we first project x onto the space spanned by f and denote the residual by u. Then we explore 
the conditional dependency structure of and given fif - **^- 7 ') by projecting them onto the space 
orthogonal to the space (linearly or additively) spanned by Here, we compare the performances 

of methods using the external factor and those that ignore them. The average ROC curves are rendered 
in Figure [5] 

From the figure, we see that first of all, when 5 (f) = f, the methods using external factors outperform 
their counterparts without using the information with the best method being lasso.dcov (lasso regression 


















14 J. Fan, Y. Feng, and L. Xia 


Fig. 5. ROC curves for a general graphical model with external factors 



0.0 

0.2 0.4 0.6 0.8 

False positive rate 

1.0 

0.0 

0.2 0.4 0.6 0.8 

False positive rate 

1.0 


(A) n = 100, 5 (f) = f 



(B) n = 300, 5 (f) = f 




! 


i 



(C) n = 100, 5 (f) = f 2 


(D) n = 300, 5 (f) = f 2 


based projection coupled with distance covariance). Second, when we have nonlinear factors, using the 
factors do not necessarily help when we only consider linear projection. For example, the performances 
of lasso.pearson and lasso.pearson.f in panel (c) illustrates this point. On the other hand, by using sparse 
additive model based projection, we have a substantial gain over all the remaining methods especially for 
n = 300. 


5. Real Data Analysis 


5.1. Financial Data 


In the first empirical example, we consider the Fama-French three-factor model (Fama and French, 1993). 
We collect daily excess returns of 90 stocks among the S&P 100 index, which are available between August 
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19, 2004 and August 19, 2005. We chose the starting date as Google’s Initial Public Offering date, and 
consider one year of daily excess returns since then. In particular, we consider the following three-factor 
model 

fit - Tft = A,MKT(MKT t — rft ) + A.smbSMBj + A,HMLHML t + u it , 


for i = 1,..., 90 and t = 1,..., 252. At time t, ru represents the return for stock i, rft is the risk-free 
return rate, and MKT, SMB and HML constitute market, size and value factors. 

We perform P-DCov test with FDR control on all pairs of stocks and study the dependence between 
stocks conditional on the Fama-French three-factors. Under significance level a = 0.01, we found out 
that 15.46% of the pairs of stocks are conditionally dependent given the three factors, which implies 
that the three factors may not be sufficient to explain the dependencies among stocks. As a comparison, 
we also implemented the conditional independence test with the distance covariance based test replaced 
by Pearson correlation based test. It turns out the 9.34% of the pairs are significant under the same 
significance level. This shows the P-DCov test is more powerful in discovering significant conditionally 
dependent pairs than the Pearson correlation test. 

We then investigate the top 5 pairs of stocks that correspond to the largest test statistic values using 
the P-DCov test. They are (BHI, SLB), (CVX, XOM), (HAL, SLB), (COP, CVX), and (BHI, HAL). 
Interestingly, all six stocks involved are closely related to the oil industry. This reveals the high level 
of dependence among oil industry stocks that cannot be well explained by the Fama-French three-factor 
model. In addition, we examine the stock pairs that are conditionally dependent under the P-DCov test 
but not under the Pearson correlation test. The two most significant pairs are (C, USB) and (MRK, 
PFE). The first pair are in the financial industry (Citigroup and U.S. Bancorp) and the second pair 
are pharmaceutical companies (Merck & Co. and Pfizer). This shows that by using the proposed P- 
DCov, some interesting conditional dependence structures could be recovered. This is consistent with 
the findings that the sector correlations are still present even after adjusting Fama-French factors and 10 


industrial factors (Fan et al ., 2016a). 


5.2. Breast Cancer Data 

In this section, we explore the difference in genetic networks between breast cancer patients who achieve 
pathologic Complete Response (pCR) and patients who do not achieve pCR. Achieving pCR is defined 
as no invasive and no in situ residuals left in breast in the surgical specimen. As studied in Kue rer et al.\ 
(1999) and Kaufmann et al. (2006), pCR has predicted long-term outcome in several neoadjuvant studies 
and hence serves as a potential surrogate marker for survival. In this study, we use the normalized gene 
expression data of 130 patients with stages I-III breast cancers analyzed by Hess et al. (2006). Among the 
130 patients, 33 of them achieved pCR (class 1), while the other 97 patients did not achieve pCR (class 
2). To construct the conditional dependence network for each class, we first perform a two-sample f-test 
between the two groups and select the 100 genes with the smallest p-values. Afterwards, we construct 
networks of these 100 selected genes for each class using P-DCov with the FDR control at level a = 0.01. 
Notice, in this case, d = 100 and the corresponding sample sizes in two groups are n\ = 33 and ri 2 = 97 
respectively. 

In networks, the degree of a particular node describes how many edges are connected to this node, 
and the average degree serves as a measure of connectivity of the graphs. In Figure [Gj we summarize 
the distribution of degrees for the genetic networks of class 1 and class 2 respectively. We see that the 
average degree of genetic network for class 1 is 9.7 which is much smaller than the average degree of 
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Fig. 6. Degree distribution of the genetic networks 



0 5 10 15 20 25 30 35 

degree 


(A) Class 1 



degree 

(B) Class 2 


network for class 2, which is 44.88. To look at the networks more closely, we select 7 genes among the 100 
genes and draw the corresponding networks in Figure [7J We see that for Class 1 where pCR is achieved, 
gene MCCC2 is a hub and is connected with three other genes. However, in the network for Class 2, 
gene MCCC2 is disconnected from the other six genes. On the other hand, gene MAPT is isolated in the 
network for Class 1, but is connected with two other genes in class 2. 

These findings imply that the two classes may have very different conditional dependence structures, 
and hence likely to have different precision matrices. As a result, when classification is the target, linear 
discriminant analysis may be too simple to capture the actual decision boundary. 


6. Discussion 

In this work, we proposed a general framework for testing conditional independence via projection and 
showed a new way to create dependency graphs. The current theoretical results assume that contribution 
of factors is sparse linear. How to extend the theory to the case of sparse additive model projection 
would be an interesting future work. Another interesting direction is to use the proposed test to create 
dependency graphs among groups of nodes, which could have applications in genetics. 

An R package pgraph for implementing the proposed methodology is available on CRAN. 

A. Proofs 

Lemma 3. Under Condition [?| we have max^- ||(B X - B x )(f i - f,)|| = O p ( a n ) and Emax^ j ||(B a , — 
B*)(fi - f,)|| =0(a n ). 

PROOF. From Condition |ij it is obvious that rnax, :? || (B^, — Ba,)(fi — fj)|| = O p (a n ). Let U n = 
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Fig. 7. Genetic networks for the two classes based on 7 selected genes 



(A) Class 1 


(B) Class 2 


ma Xij || (B x — — fj-)|| and U n = U n /a n . Then, we have 

/*oo 

E(f7 n ) = / P(C/n > t)dt 

J 0 

pi poo 

= J P(U n > t)dt + J P(U n > t)dt 

/ oo 

Cj~ t dt < oo. 

As a result, the lemma is proved. 

For the remaining proofs, we apply Taylor expansion to || et iX — £j lX \\ at e i,x — c hx and get 

c T - 

+ j?—-—7 (Bj — — fj) = ||ej iX — 6j iX || + Dij x , 


I e i,y e j,v II — II e i,v e j,v I 


l,J ,X 

T 

%j,y 


+ ll„ II (By Bj / )(fj fj) — \\e.i,y £j,y || + DiJ^y, 


“h3,V\ 


( 12 ) 


where Cij iX — A £j,x) T(1 A £j.x) and Cip.y — A i,j,y(£i,y ^t,y)T(l A i,j,y')(£i,y ^j,y)i 
for A itjtX G [0,1] and G [0,1]. 

PROOF of Theorem [TJ Using the Taylor expansion in (12), we have the following decomposition 


Vni^Xi £y) ~ Vni e x, £y) ~ T\ + T 2 + T 3 , 


where 


^ n i n i n ^ n n 

Tl = 53 -Di,j,x||ei,?/ ~ e j,y\\ + ~2 53 53 — ei ,i/|| — U3 53 53 — e k,y II) 

i i It l v lb 

i,j =1 M=1 k,l—l i—1 j,k =1 

(13) 

^ n l n l n 2 U n 

^2 = —2 53 — e i.*n + Z2 53 ^ i d<vz2 53 ii efc ’ x ~ ej . x ii — Z3 5Z 53 _ 


i,i=l 


*lJ=l 


fc,;=l 


i=i j,fc=i 


( 14 ) 
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^ n i n i n ^ n n 

= ififi A,i,xA,j,j/ + ^ X! A,j,x^2 A,j,t/ _ ^3 X! Aj.xA.fc.y 

2 ,j=l * ?i ? = l 2 ,j=l i=l j,fc=l 

By Condition [3j we have max,,, = O p {a n log n). Therefore, 

( 4 n 

— 2 ^ 1 II e i,y — 

U i , j =1 

/ 4 n 

|T 2 | < O p (a n log n) ^ ||e ijSC - €j, a 


( 15 ) 


i,l=l 


|A| < O p ((a n logn) 2 ). 


Another fact we easily observe is that: n 2 J2i ,= 1 


,j=l ll^ 2 ,rc 


= O p (l), since E||e ijX - e jtX || is 


uniformly bounded over all (i,j) pairs and so is E(n 2 YYi ,•= 1 ll e i,a; ' e j, 


As a result, we know V 2 (e x , e y ) — V 2 (e x , e y ) -o- 0. This combined with Lemma [l] leads to 

£ y ) -t V 2 (e x ,e y ). 

Remark: The result of Theorem [l] cannot be implied from that of Theorem [2j since independence 


between e x and e y is not assumed. 


Lemma 4. For the c ij tX and c ij tV defined in (12), we have the following approximation error bound 
on the normalized version. 


^i,j,x 

£-i,x 

e j,x 

ll c *,J,a:|| 

|| ^i,x 

~ e J>ll 

c i,j,y 

e i,y 

— e j,y 

1 \ c i,j,y II 

II e i,y 

~ e j,y\\ 


< 


< 


£ i.. t. 


■ max || (B x - B x )(£j - fy)||. 


^ 2 , X ^-J,X\\ l i0 

ma,x||(B y -B y )(f i -f 7 )||. 


PROOF. It suffices to show (16). First, we will show 


^l,X J,X 




< 


| ^ 2 , x £ 7 , 2 ; || 


\^-i,x € 7 , 2 ; || 


(16) 

(17) 

(18) 


Denote by a\ and 0:2 the angle between c ij >x and £i jX — £j jX , and the angle between Y,x — &j,x and 
e i,x ~ e j,xi respectively. It is easy to see that 0 < cm < a 2 < 7 r, and hence coscm > cosct 2 . By cosine 
formula, 


£%,x ^j,x 




= 2 — 2 cos a \, and 


£%,X £j,x 


£%,X £j,X 


= 2 — 2 cosa 2 . 


Therefore, (18) is proved and it remains to show that 

2 

< - 


^3,x | 


max 


B - - f, 


(19) 


Left hand side of (|19|) can be rewritten as 

^ 2 , X £ j,x 


|| c 2 ,ai C J,X\\ ll c, 2 ,x 
[(^2,x £7,21) (^2,2; £7,2;)] 




,,X £j,2j|| II ^2 ,X ^j,X II ) (^2,2? ^jt x ) 


<- 


1 


^2, X ^j,x) (^2,£ 1 


^-i,X ^"jiX II 11 €-i.nr. €i. 


2,21 ^j,x\\ 


<- 


max 


E i,x ~ e j,x\\ 2 ,jG{l,...,n} 


(B x — B x )(fj — fj)||. 


Combining (18) and (19), the lemma is proved. 
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Lemma 5. Under Conditions and^| and the null hypothesis that e x X e y , for any 7 > 0, 


1 

l ” 1 

1 y^ 1 

4 0 , 

1 

1 ” 1 

1 1 

n 7 log n 

77,2 ijf'i ll e X ~ foxII 

n 7 log n 

n2 l e vy - fo.vll 


PROOF. We will only show the first result involving e x with the other one follows similarly. For any 
S > 0, let 


Rn ~ „2 


1 


r> Rn ~ „2 ^2 


J l|^2,x fo,x 

ij—i > 


1 


A n 


2+5 


i,j —1 


Then for V e > 0, 


P[| R n - Rn | > e] < n z P{\\e t ' X - e jtX \\ < n~ 2 ~ s } < Cn 2 n~ 2 - a = Cn 


2 - 2-8 


,-8 


( 20 ) 


due to the Condition [ 2 ] that the density function of ||— ep x || is pointwise bounded. Therefore, 

— p 

| R n — Rn | —)• 0, which leads to 


Rn 


Rn 


n 7 log n n 7 log n 


0 . 


( 21 ) 


On the other hand, 

- 1 - 

log n II Ci,x - £j ,a 


An 2+ *1 = ^-P(, 


> n 2+s )n 2+s + 


log n \\c iiX - e jiX 


1 


< 


< 


< 


C 


rC 0 


1 


logn 

+ 

logn 

C 


C 

logn 


logn 

C 

+ 

C 


h x {x)dx ■ 


1 


1 , 


logn J n -2—5 t 

OO ^ 


I (t)dt 


log nJ Ca t 
-dx + -P(||e iiX - e j}X \\ > C 0 ) 


( t)dt 


[log (C 0 ) + log(n 2+l5 )] 


1 


C 0 log n 


<^ + C' 


1 


log n Co log n ’ 


( 22 ) 


where h\\ e . x _ £j is the density of ||e iiX — e JiX ||. In the above derivation, the first inequality can be easily 
seen from (201 and the second inequality utilizes Condition [2] 

Therefore, R n /\ogn is bounded in L\ and since n 7 —> 00 , -R n /[n 7 log(n)] converges to 0 in L 1 and 
hence in probability, i.e., 


Rn 


n' 7 log(n) 


4 0. 


(23) 


This, combined with (211 yields 


This completes the proof of Lemma [bj 


Rn 


n 7 log(n) 


4 0. 


( 24 ) 


To prove Theorem [2j we first introduce two propositions. 

Proposition 1. Under Conditions Q] and^J and the null hypothesis that e x X e y , 


Ti = O p (a n /n ), T 2 = O p (a n /n) 
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PROOF. From (| 13 |), we rewrite T\ as 

1 CA „ / „ 1 


Tl ~ „2 


n i n i n 

55 ll 6fc .y — e byll — F. 55 H e *>y — e fe,yll — “ 55 II 6 j'i!/ — £k ,y 


i,j =1 


fc,Z = l 


fc=1 


fc=l 


— „2 55 Di,j, x Ai t j t y, 


i,j =1 

with Aij^y self-defined by the equation. 

Let us consider term 

We can separate the above quantity into three parts. It is easy to see that D, hx are identically 
distributed with respect to different pairs of (i. j) when i ^ j. Let us define the following three sets of 
index quadruples: 

• I\ = k, Z)|there are four distinct values in {i,j, k,l}}. 

• I 2 = k, l)\i 7 ^ j , k 7 ^ l, and there are three distinct values in {*, j, k , /}}. 

• I3 = k, l)\i 7^ j , k 7^ l, and there are two distinct values in {*, j, k , l}}. 

Let us suppose K(Dij iX Dk t i tX ) = ci, for ( i,j,k,l) £ h; E(Aj,*%,*) = c 2 , for ( i,j,k,l ) £ / 2 . 
^(Dij, x Dk,i,x) = C 3 , for ( i,j,k,l ) £ 13 . By Condition^ we know ci, c 2 and C 3 are all of order O(a^). 
Also, E(Ai iJi j / ) = 0(1). Then we have 


) — ® ( 4 55 Ai,j,yAk,l t y + 4 E] -^i,j,yAk,l,y + 4 ^5 Ai t j t yAk,l,y 

\ n h n 12 Is 


( 26 ) 


On the other hand, we observe that y~)”_ 4 j y = 0 by definition and A it j y = Aj iy , so we have 

n n n n n n 

E A^yA k ^ y = E(E ^,») 2 -EE l E .v = - E E 'E- 

/ 2 i=l j=l i=l j = 1 i=l j=l 


l e *,y e i,yll > ll e i,y 


By Condition [l] we know all the second order terms of distances of differences 
£j, y \\ ' II e i,y ~~ e k, y \\ as examples) have bounded expectation, and thus all the second order terms of 
Aij >y ’s also have bounded expectations. Therefore, E(n -4 Aij iV Ak,i, y ) = 0{n~ 2 ). Finally, since 

e-=ie;=i%v=o, 

n n n 

55 Ai,j,yAk,l,y =(E 55 Ai,j,y) ~ J5 Ai,j,yAk,l,y — ^5 Ai,j,yAk,l,y — ^5 ^i,i,y 
h i= 1 j =1 J 2 J 3 i=l 

n 

= — 55 Ai,j,yAk,l,y — E Ai,j,yAk,l,y — E ^ 

h I 3 i =1 

This combined with our previous calculations leads to E(n _4 ^ 7i Aij tV Ak,i, y ) = 0(n~ 2 ). As a result, 
we have E(T^) = 0(a 2 /n 2 ). Together with Chebychev’s inequality, we know T-j 2 = O p (a^ l /n 2 ) and 
equivalently, Tf = O p (a n /n). Similarly, we could show that T 2 = O p {a n /n). 

Proposition 2. Under Conditions [7j Q [3| and[7J and the null hypothesis that e x A e y , 

( a 2 n log A' 


T 3 = O py 


-)■ 
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Proof. Recall that 

^ n i n i n ^ n n 

= ^2 E ^hj,xDi,j t y + E E D'-rnv ~ ^3 X] X] Di,j, x D itkt y 

*,j=i 

1 n 

= ,,,2 


i,j =1 


*li=l 


j,k=l 


i,j=l 


with Bi j y self-defined in the above equation. We can easily see that Y^i=i Bi t j, y = 0, for any /. Let 
Umax = maxjj |Sjj-,y|, then we define Bi t j tV = Bij iV /(2B niax ) + 0.5. In this way, we know Bij tV £ [0,1] 
and £r=i B l : j y = 1/2 for any j. By Condition [ 3 } we know that B m ax = O p (a n ). 

Then we can rewrite T 3 in the following form: 


T .3 = 


2 / 1 ,, 


E a,. 

» j=i 




B 


max ^ ^ t-> . rji rp 

/ J u i,j,x — -^31 — -* 32 - 

i,j =1 


Let us look at T 31 first. If we denote D and B as the matrix of dimension nx n composed of elements 


Di j x and B t ,j y , we know that 


IT31 1 < 


2 B„ 


|D|| F ||Bj| F = Op(a n /n 2 )Op(a n n)O p (y/n) = O p {^=). 


(27) 


n~ . ' v II 

Then, let us proceed to term T 32 . Here, we write Di j X in another form as a sum of two terms and bound 
them separately. 

€j,x) }( B * - § *x f * - f /)+( 


Bi,j,x 


c l,X J,X\ 


£-l,X ^-J-,X I 


(Ba; - B rc )(f \ - fj) 


T 


_ (^i,x €j,x ) 

II e i,x ~ e j,x\\ 

As a result, we know 

B n 


(B x - - f )) + r itj 


E r id 


Bmax E ( ,E ^ (B^B,)^-!,). 


(28) 


By Lemma |4j we know 


< max || (Bj, - B x )(fj - f, 


^ i,x ^J,x\ 


where max^ ||(B X - B x )(f, - f,)|| 2 = O p (a 2 n ). 

So the first term in T 32 has rate n~ 2 i? max £” . =1 rij }X = O p (a^ (logn)n 7 ). 
The second term in T 32 can be rewritten in terms of trace: 


/i„ 


n ( _ \T 

E ,fr~ 6 r,, (Ba—B x )(fj ~ fj) 


^ -i ^ 2 , 2 ; 

2 ,J = 1 " ’ • y ’ 


(29) 


B max Tr ( (B. r - B c ) — ^2 ( f i “ f i) 


*»i=i 


(^i,x ^j,x) 


^i,x ^J,x\ 


c)w) 


E B max E ll-Bx.J - E*lli max|W(*, j)|, 

z 2,7 

;=i 


(30) 


where W is self-defined and W(i,j) is the element on the j-th row and /-column of matrix W. Let us 
take (i,j) = ( 1 , 1 ) as an example, and look at W(l, 1 ) = ^2 £" J= i(/i,i — fj,i) liE-E’ll ‘ We eas ^y see 
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that EW(1,1) = 0, due to facts: and e^ x are mutually independent of f with any observation indices; 

and E[(e ija; - e^ x )/\\e i>x - e jtX \\] = 0. Furthermore, 


E(W(1, l) 2 ) = — £ 


€i,x,l €j,x,l 


i,j,k,l =1 


\ti.x € 


(/m — fl, l) 


0,x\ 


€k,x,l €l,x,l 
|| €-k,x €l,x || 


Similar to the reasoning in Proposition [TJ we have n 4 terms in I\. But in this scenario, E(/,; i — 


/mV',:' 


f'*’ 1 = 0 due to independence, therefore we know 


E(VF(1, l) 2 ) = 0(l/n). 

As a result, we know |W(1,1)| = O p (n -1 / 2 ), and thus maxjj \W{i,j)\ = O p {n~ 1 / 2 \ogK). Furthermore, 
we can bound the term in fl30| with rate O p (n~ 1 ^ 2 a n e n log A'). 

Combining T 31 and T 32 , we know T 3 = O p {(a^(logn)n 7 ) V (n _ 1 / 2 a n e„ log A")} and the proposition is 
proved by looking at Conditions [3] and |4j 


Proof of Theorem [2] Recall the notations we used in the proof of Theorem [l] 


Vn( e xi e y) 


= T 1 +T 2 +T 3 . 


By Propositions [l] and [2j we have for any 7 > 0, 


n{T 1 +T 2 + T 3 ) = O p (a n ) + O p {{n 1+1 (log n)a 3 n ) V {a n e n log A'y^)}. 


Combined with Lemma [2j the theorem is proved. 

Proof of Corollary [2] The result follows directly from the proofs of Theorems [I] and [2] and an 
application of Slutsky’s theorem. 

PROOF of Theorem [3j The proof of Theorem [3] follows similarly as Theorem 6 in |Szekely et al. 
(2007). Here we omit the details for brevity. 
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